2022 Meeting Notes
2022 Meeting Notes#
July 1 as first day!
Trying to pay attention to issues, slack, email & join in wherever.
Wears more than 1 hat, taking off hats as appropriate. Intent is to be generally available for most 2i2c things
Interested in ‘making research & education delightful’
Documentation, training, onboarding!
Nature of research & edu is that we keep having new cohorts of people who need to be onboarded!
Participating in project Pythia hackathon working on documentation (xarray, dask, etc)
Figuring out and joining different communities!
pangeo is a big one (one of the ones that 2i2c started to operationalize!)
Finding names of other communities we’re a part of! Finding their names, connecting on their communities, etc.
Registered for Scientific community management program (along with Sarah!)
Want to enable other community managers too! So they feel empowered to do the things they need todo
Right to Replicate is core value of 2i2c
Some experience in running helm, k8s, z2jh, etc.
Has been a while, but am going to go through 2i2c’s docs and see if it ‘works’!
Fill in holes as it’s needed, get a feel for product as it is
Run it on something like minikube to test it out
Figure out what a 3-6 month strategy for division of product & community would look like
Already collecting with Jim on the partnership & sales side of it
Where are the sources of truth? Other ones James hasn’t explored?
Some on google drive
Figuring out how to use team shared drives
Google offers it for free for non profits, but because we’re under CSS they might already have it
Chris is going to figure out a shared Google Drive setup.
Language and naming around our managed hub service#
New diagrams of our service model from a recent CZI grant: https://docs.google.com/presentation/d/12Xkpin0sNwcBygArFUSalkrgfONgaRrfUYokyfIWWh8/edit
What do you call the process of what we provide and communicate that to communities?
How do you bring in the concept of that shared responsibility so we don’t pigeonhole ourself as a SaaS?
‘Managed’ implies 2i2c does ‘everything’ which is not true - we run it collaboratively
As a user, being able to have control over the user experience without requiring that I leave.
Open source communities often run into these issues
Don’t want to create two “classes” of people
Need to create community processes to help people do this
Right to Participate is important
Super early draft: https://hackmd.io/Y5SBMxV7R6CMqzeTXgm5kA?edit
Words to describe this
Managed JupyterHub service was picked because cloud service was managed by us, so others won’t have to worry about it
We call it JupyterHub to give people the idea that it is an open source component, just using OSS tools
One proposal: “Collaborative Hub”
But feels like this refers to the artifact (e.g. “collaborative editing”) not the process.
Make up a new word? (a neologism)
Collaborative hub model can be part of the bullet points, can be a brand new thing.
Are we providing jupyterhub, and our product is jupyterhub? Or is it JupyterHub+ other stuff? not sure what to call it.
Constraints / guiding principles to define this service#
Key words / ideas
Unburdening - “We do work so you don’t have to”
Participatory / Collaborative - “We invite your participation at several levels if you wish”
Transparency - “All our work is done in the open”
Reciprocity - “We do most of our work via upstream improvements”
“Levels of Participation” and “Shared Responsibility” are both aspects of the service. Describe how we have a range of collaboration options.
Putting many concepts into a single name is probably not possible, so we should choose a name that is not misleading and explain the details with extra documentation.
Contribute to a pre-existing large multi-stakeholder community
Collaborative, participatory service
Erik: Transparent operations
Sarah: Unburdening (traditionally, this is ‘put in a postdoc who has no k8s knowledge’)
Chris: Participatory is an important one, the ways in which the service is trying to take inspiration from healthy inclusive OSS communities.
Notion there is OSS as a license, you can see all the code and you can fork it! But the dynamics of production are a single team of people (docutils) where they don’t care about increasing participation.
But participatory is something like JupyterHub itself, where there are pathways to getting in and increasing the set of people who are participating.
Tension between participatory and unburdening, as we’re asking for labor for other organizations.
Sarah: “Levels of participation”, which we do have.
We are running a managed kubernetes service & a collaborative hub service.
We don’t get specific k8s feature requests because they don’t know what they want but they do ask for specific hub features!
It’s our job to figure out if it’s a k8s feature or not.
Extra complicated to put all this in a single term as it’s a spectrum of variation of input as we go down our big stack
Damian: Trying to have all these complex concepts working at different levels in one name is impossible.
So maybe we just have a name for the thing we are offering. And a description of what we are offering, otherwise it’ll be a bit difficult to come up with a name.
We are providing JupyterHub+ other things, we are offering an opinionated way to deploy hubs on the cloud. This opinionated way is the product itself.
Chris: Signal boost communities compare services by end product (features, end product I get, etc) and 2i2c needs to sell that the process by which you get it is as important as the product you are selling itself. Not intuitive for a lot of communities that are used to the producer / consumer pigoenhole that tech tends to create. Process is important!
Damian: Right to participate, evolving and getting it written will help us fight the battles against University IT, etc. Will help us get into that conversation at the level where we don’t end up in the situation wrt columbia credits again.
Yuvi: “Upstream” and “Upstream contributions” is another important part of this. It is a differentiator between what we do and many other startups. We contribute to a pre-existing large multi-stakeholder community.
Discuss making our operation and dev items assignments being two people by default#
How do we managed inefficiencies in the pairment
e.g., time zones and bottlenecks
One ‘lead’ developer who can work on the project and a ‘companion’ person who can help iterate quickly
We did this with sarah & erik for the CI/CD revamp! Worked really great, showcased the pattern is pretty good.
Trying to make this more official
Damian is trying to make requests for new hubs to use pairs of people, as these are ‘must do’
Mismatch of tz makes things a bit inefficient.
Should come up with teams that might change composition, that collaborate together. But how to deal with tz?
We should adopt pairing in almost all our projects!
Chris: What are the 2/3 main problems we’re trying to resolve with pairing?
People start working on projects because they are assigned to work on those, but they are not getting feedback as soon as they are helpful. Lots of conversations / PRs open for days and days, but not because people don’t like to provide feedback. Need to have interactive discussion to push things forward. Pairing tries to force this.
We have a lot of complexity in our stack! Makes it easier to push forward because things can be complicated.
Prevent siloing our knowledge about the idiosynchrasies of the infrastructure and of skills in general.
Need to get credentials for James
Chris is championing the onboarding, Sarah has offered to help with the technical bits
Notetaker: Erik, Jim
Job hire update/decision making/discussion#
Brief description of current candidates
Describe suggestion to make an offer
Question: Any objections to making the offer?
Make an offer:
CH: meeting goal: to get a feeling one way or another to provide a job offer for a candidate.
Important that we document the process we used.
DA: We have consensus on hiring top candidate? +1s all around…
Next steps (CH)
offer position to top candidate. CH will send email soon!
do a budget analysis to look into other hires
Matlab as a service?#
Is this mis-aligned with our mission or values?
Erik: a wish for clarification of the goal, “intergration with matlab” is vague
Should we offer this support to OpenScapes?
Estimates on implementation cost?
JC: How to handle licensing?
This was very difficult when they did this at PIMS
Instead they offered an Octave kernel inside a JupyterHub
Maybe this is what we should do here.
what does this mean?
SG: Binder supports Octave.
costs associated with license management should be passed on to openscapes
Berkeley runs a hub with Matlab
It is complex to run because of the licensing
It does help bring users on.
Ryan Lovett is the one that runs it.
What should this cost?
Premium for “non-open” special-casing
Premium for extra work we’d have to do to support this.
Premium for the uncertainty in doing this given we haven’t done it before.
YP: Don’t have an ideological problem with this plan, Matlab is just part of their image.
ES: Who manages the Openscapes image? How does this incur additional work? SG: How to handle support tickets?
CH: One way to see how serious they are about this request is to make it cost a lot of money.
They should figure out the licensing costs
Not aligned with the open source vision
estimate a premium on extra special casing
uncertainty in this scenario so should be a premium cost
just charge given the other things we might prioritize higher.
DA: I am worried about the support piece of this plan.
YP: They can do what they want with their image. Key distinction: Openscapes can do it vs. 2i2c helps Openscapes do it.
[x] Jim to reply to the issue. Open tickets on support instead of inside the infrastructure repo. We can provide faster responses.
Support process updates#
Brief overview of process
Any concerns or feedback?
Any objections to trying this?
Action: Engineering team to review and provide feedback.
Metadocencia / Carpentries colab?#
JC: Callysto and Metadocencia have overlapping missions. 2i2c could catalyze collaboration among Callysto + Metadocencia + Carpentries. This has potential.
CH: a person was worried about commercial cloud use, how does 2i2c currently think about third-party clouds etc?
CH: Users worried about commercial clouds could come try at 2i2c, and by doing so, learn more about the deployment and then transition with the R2R.
SG: considers digital ocean as another cloud we could support
SG: partnership model involving training could be possible, to help them grow their capability of doing what 2i2c does in their own cloud
CH: a large amount of funding is potentially available
DA: supports that there is a need for non-commercial clouds
CH: asks DA to connect further
DA: is happy to connect
Damian as a project manager (20min)#
Proposal for Damian to serve as a 50% service manager as a trial run.
More background here: https://github.com/2i2c-org/team-compass/issues/398
This isn’t going to be a huge change to our practices
Initially the focus is on the practices we’ve already discussed/agreed to.
We can evolve it over time as well
I think this is a great idea and thank you to Damian!
Concern: would this disrupt Damian’s workflow? How can we balance his 50% time?
We should keep track of this over time to make sure it’s sustainable for Damian
We should also be supportive of Damian in this role, and be on the lookout for whether it seems unsustainable for him
What’s the overlap with the Product/Community Lead?
Product/Community Lead is more external-facing, and the Project Manager is more internal/team-facing.
Project Manager should oversee the team “process” and make sure we’re following it.
Support steward and team roles discussion (30min)#
We need to adjust the frequency of these so they don’t overlap
Suggestion to rotate support steward based on timezone rather than alphabetical (ref: github issue)
Is it realistic to have a dedicated operations person
Since we have a 2 person team at support, let’s make one of them the level-1 and the other one is level-2.
You are level-1 in the first two weeks, level-2 in the next two weeks.
level-1 would be the client-facing (you only redirect things) and level-2 would be the dedicated operation eng (still with support of the whole team, of course).
How would we choose who is in which role?
If one person is dedicated to operations issues, would this mean they don’t have capacity for development work during that time?
To a degree, this is already true. The Support Steward tends to feel like they need to fix the problem as well as do support.
This may not solve the geography problem. If you are on a European time zone
We need more engineering capacity on the west coast ultimately
What if some team-mates cannot perform these actions? e.g., non-engineer team members?
What about expectation setting? Can we tell communities we won’t guarantee response in < 24 hours?
If we quantified and tracked our “service level expectations” then it could make it easier to reduce stress associated w/ outages, and set expectations with other communities.
We need a definition of “uptime”
What are the “Key Performance Indicators” for service levels?
Server start-up times
Could we define specific times where we are in “ready mode” for a hub, time-box this so it is not indefinite, and ask community representatives to choose when they request this extra attention?
This would add some extra complexity to manage these processes.
But, we are already paying this penalty in “implicit” ways.
Chris proposal: we treat first-line support and first-line operations as two separate teams.
General agreement on this.
But still haven’t figured out the timezones issue
We really only care about the “major issues” when it comes to timezones”. Most things are not required to fix immediately.
This is where a pager can be helpful.
Incident Commander is a concept from SRE
When there is an outage, this person makes sure that it is resolved.
Used in “must be fixed immediately” situations.
Something like Pager Duty is a good tool to try out.
What are the 2-3 things we can do next to improve things?
Look at our support tickets so far, and define a “hierarchy” of ticket types / priority queue, and define an escalation pathway for each.
After we do this, next steps can be working out the details of how we actually address those priority levels.
Should we also prioritize monitoring infrastructure?
Yes, general agreement that this would be helpful.
Defining a metric for service reliability could be a part of this.
Team meeting length (10min)#
Is 90 minutes too-long for our monthly team meeting?
Let’s have a monthly enjoy/informal meeting (30m, NO tech stuff, opt-in, open agenda)
Let’s have a monthly engineering meeting (1h, GENERAL tech stuff, all hands, ONLY agenda)
General agreement that this would be nice.
Could we block off the 90 minutes but still try to keep our meetings to 60 minutes?
Kirstie’s lab does this and it works nicely.
Team meetings are 60 minutes, but reserve a time slot for 90 minutes, and use the last 30 minutes as a “break glass in case of emergency” time to discuss.
Also find a way to make team space for informal conversation and chats with one another, but not as a part of the monthly team meeting.
Team meeting agenda coordination (5min)#
Our current agenda is split between a Google Calendar, a GitHub issue, and this HackMD.
We’ve hit confusion about syncing information across them etc.
Let’s simplify with this structure:
Do not use a dedicated issue for these meetings. Remove the issue template for team meetings.
The 2i2c Team Calendar is the source of truth for this meeting. We expect team members to keep up-to-date w/ this calendar to know when meetings will happen.
Meeting calendar events have one link, which is to the meeting agenda/notes.
In the agenda/notes, we keep all relevant information about the meeting (links, roles, etc…but not dates).
General agreement on this!
Will the Project Manager lead the sprint planning meetings now?#
Yes - the Project Manager can lead this meeting.
The “meeting facilitator” can serve in a support role for this meeting (not sure exactly what this means but we can figure it out).
Meeting facilitator: @erik
Meeting timekeeper: @damian
Meeting recorder: @erik
New job will be posted this or next week. Feedback is encouraged! https://github.com/2i2c-org/2i2c-org.github.io/pull/93
Hub service growth plan#
Currently serving about 20-25 communities.
Large variability in size. Some as small as 2-3 (Catalyst Cooperative), some as large as 200-300 (Toronto)
Large variability in complexity. Some have vanilla data8 environments (community colleges), others have monumental Dockerfiles (Pangeo)
As we grow, we also create more work for ourselves.
At the end of our runway (say, 24 months), we need to have a sustainable model.
What’s the right way to intentionally grow so that we can refine our infrastructure and processes?
Phases of growth with fixed number of “slots” for communities in each phase.
Three major areas that we are working out
Infrastructure / tech
Process for operations and support
Sales and invoicing
For each area, define questions that we want to answer, or conditions that we want to meet, in order to trigger a new batch of communities that we’ll take on.
Write up a blog post about this plan and share it for feedback, even if we don’t intend on taking on any new communities for some time.
Questions to ask#
What should our target number of communities be? (for now, in 1 month, in 3 months?)
Currently we have around 15 basic hubs, 7 research hubs, 3 experimental hubs
We need different kinds of slots to help indicate the amount of work they involve:
The amount of usage
Customization to authenticators
Image complexity: entirely custom image, repo2docker based image
Hub adjustments: sidecar, gpu
Initial development efforts
Community knowledge: they offload us by avoiding issues in the first place, and being able to understand the complexity of what they need
What is in scope, what isn’t?
More standardized hubs -> we can support more communities. We need to have a grasp of what we aim for with regards to this.
Avoid one off features
Answer 1: Projecting into the feature is one way to answer
Sarah: very few new communities right now, because we are to handle a new hub type: binderhubs
What is our biggest source of risk?
If there were one thing you’d focus on to boost our capacity, what would it be?
What can constrain our growth?
unreliability in automation to deploy, it is a timesink and can also break your focus repeatedly while waiting
we have made experimental exceptions on what hubs should accomplish
Overviewing the experimental adjustments as hubs grow is a challenge
While we need to experiment to meet users needs, we need to work to bring back experimentational features to general features or have a practice for managing the experiemental features.
Some hubs tends to require more work than others, how do we handle such differences?
An engineer needs to gague the complexity of implementing new features and make a call if it is seen as sustainable to maintain.
We can be constrained by features only a subset of us understands how to maintain it:
“will this feature be something everybody needs to learn about in the end?”
We don’t want to constrain ourselves to the lowest common denominator of shared knowledge
lack of monitoring
Goals for the alpha#
The goal is a phased approach to growing, for that, we need to think more clearly about our growth goals and define them more explicitly.
Idea: to have a concept of capacity, and how we increase that over time.
How to improve team coordination or management#
Currently, we’re using our weekly check-ins + sprint planning + async issues to coordinate.
There are a few kinds of work that the engineering team is doing.
Development projects (often span multiple weeks, and self-driven)
Expected operations tasks (e.g., expected maintenance, setting up a hub, etc)
Operations and support tasks (aka, helping when something is wrong, more unpredictable and come with more urgency)
The last two things are tricky - they are more unpredictable, but also more important that we do them quickly and on schedule.
This is particularly important when interfacing with sales, since a sales person will make a commitment to running the hub by a certain date.
Ultimately, we need to be able to:
Give somebody a concrete date (or guaranteed turnaround time) when their issue is resolved / hub is deployed / etc.
Be confident in our team processes that we will definitely get it done by that date, or raise a flag to change plan.
Our current model assumes that Chris is keeping the context of what everybody is working on in his head, but this is not scalable.
Chris feels like he is dropping balls in guiding others, refining issues, etc.
Also hard because he doesn’t have the skillset of a cloud engineer.
A few questions and ideas:
How can we improve our team processes to reduce the uncertainty around finishing support / operations issues?
Our incomming requests are hard to quickly create an action point to finish. Its often back and forth.
First line of support, second line of support
First line are generalists, they answer questions coming from all areas and help elicit relevant background information for the specialists
Second line of support are specialists
The support steward acts as the first line of support
Training for first line of support!
Digital educational content?
Defining mechanisms on to ensure support tickets resolve, or perhaps having a dedicated role to drive operations
Synthesize the conversation in order to get a clear next step
Issue about team coordinator / manager: https://github.com/2i2c-org/team-compass/issues/382
Chris suggests to summarize growth plan and write a blog posts
Issues around alpha strategy: https://github.com/2i2c-org/team-compass/issues/192
New job posting for product/community lead#
Role will be focused on maximizing the impact for communities using our infrastructure
A combination of community impact, training, and helping guide our service development
They should focus on the infrastructure from the user’s perspective, not an engineer’s, but work with engineering
CH: Background consideration: What major things would help any community? Currently, we don’t have capacity to communicate more directly to help the community make get most value out of the hubs. One part of this role is to spearhead the kind of strategy and efforts to help communities use the infra in the most effective way. Another part is to be a more direct contact person. In this way, the person would also help guide future development efforts by aquiring insights about the user experience.
CH: Skills relevant for the role:
Great communicator, good at teaching, enthusiastic.
Capable of leading the development of the role in itself
What do others think about the scope of this role? Does it make sense?
ES: At Sandvik, it would have been helpful to have somebody dedicating time to teaching others how to make use of the service.
Collecting user experience data quickly rather than slowly can be critical.
“They’d be in a unique position to validate the value of certain features”.
Could this role write feature proposals for open source communities?
GD: Could help build trust with communities, by having a more human connection
How can we incorporate this role into our team planning processes?
Feature requests and shaping for our communities
They will need to talk to engineers to know what is easy vs. hard vs. impossible to implement
Could set set up a process to plan/prioritize our next work?
If so, this person could be involved in this discussion
They will have a good perspective on what to prioritize
Does needing a specific datacenter require a dedicated cluster?#
When communities want a hub to run in a particular datacenter, does this mean we must give them their own cluster? (10m)
CH: A user wants a hub in a specific data center, but doesn’t need a dedicated cluster.
SG: This is probably to a large extent related to egress/ingress costs. We will need a new cluster in this case, but should we make it shared or dedicated to them?
SG: Of course, a shared cluster would imply we ensure to avoid bleeding credentials to access relevant resources between communities.
ES/CH/Yuvi: What about cost of egress and billing? It simplifies billing, because we can’t separate costs related to egress traffic.
YP: Networking performance gains to co-locate with data.
CH: If we were launching into separate clusters w/ the multi-cluster spawner, would that create a really huge list of user profiles people had to choose between?
SG: You can restrict the access that individuals have to different clusters, may also be able to restrict visibility
YP: Cluster-spawner would reduce complexity of the spawned clusters
CH: Want to avoid this: shared hub, 20 clusters to select from, 19 is totally irrelevant…
ES: Off topic relevant ide: provide ping times
ES: What are limitations for users of using a central cluster spawner to access a hub?
Meeting facilitator: @choldgraf
Meeting timekeeper: @choldgraf
Meeting recorder: @choldgraf
We might be thinking about this when it comes to the pangeo Binder
Would testing fix this?
In some cases, maybe, but not always
For hubs that want to have full access, testing would not be a fix
Scenario where somebody wants to have full access
A time-boxed workshop where you don’t want to hassle with manually adding a bunch of people
Could email reporting address this?
Having a single grafana that aggregates across the prometheus for each hub would help
Not clear how you’d be able to access
Could RBAC address this?
RBAC maps onto specific permissions in the hub, it’s a bit more complex for arbitrary permissions mapping
About more convenient authorization logic, an issue about a hook: https://github.com/jupyterhub/jupyterhub/issues/3490
Other project ideas
Erik: interested in Helm validation, but wants more consensus on what our helm config structure should be first.
Sarah: Standardize our Terraform structure across providers so that we minimize complexity associated with provider-specific stuff.
Figuring out a workaround for ConfigConnector. Scratch buckets + requestor pays access are two things that’d be enabled by this.
Chris: Help with migrating the Pangeo documentation. Can coordinate with Sarah to figure out which ones
Damian: Improving our staging and testing environments will make it easier to iterate and develop new things - this is also related to the Helm validation stuff that we discussed.