-
Notifications
You must be signed in to change notification settings - Fork 2.2k
[INS-206] Store Gitlab Project ID in secret location metadata #4601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[INS-206] Store Gitlab Project ID in secret location metadata #4601
Conversation
amanfcp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! A few suggestions
pkg/sources/gitlab/gitlab.go
Outdated
| // Cache project metadata if available. | ||
| if gitUnit, ok := unit.(git.SourceUnit); ok && gitUnit.Metadata != nil { | ||
| if gitUnit.Metadata["project_id"] != "" { | ||
| project := &project{} | ||
| projectId, err := strconv.Atoi(gitUnit.Metadata["project_id"]) | ||
| if err != nil { | ||
| ctx.Logger().Error(err, "could not convert project_id metadata to int", "project_id", gitUnit.Metadata["project_id"]) | ||
| } else { | ||
| project.id = projectId | ||
| } | ||
| project.name = gitUnit.Metadata["project_name"] | ||
| project.owner = gitUnit.Metadata["project_owner"] | ||
| s.repoToProjCache.set(repoURL, project) | ||
| } | ||
|
|
||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would recommend early return flow here. It's kind of difficult to read 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, totally agree. I'll make the changes
pkg/sources/gitlab/gitlab.go
Outdated
| if proj.Owner != nil { | ||
| metadata["project_owner"] = proj.Owner.Email | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're missing the username check here as handled in gitlabProjectToCacheProject
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch, thanks
…data' of mustansir:mustansir14/trufflehog into INS-206-Store-GitLab-project-ID-in-secret-location-metadata
rosecodym
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some scanning pipeline machinery that cannot yet support this, so this PR won't work as-is. I'm requesting changes until I (or maybe @mcastorina) has some time to go through it and figure out whether "won't work" actually means there will be problems, or it just won't work.
rosecodym
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty straightforward, but I have some notes and questions.
pkg/sources/gitlab/gitlab.go
Outdated
| } | ||
| // extract project path from repo URL | ||
| // https://gitlab.com/testermctestface/testy.git => testermctestface/testy | ||
| repoPath := strings.TrimSuffix(strings.Join(strings.Split(repoUrl, "/")[len(strings.Split(repoUrl, "/"))-2:], "/"), ".git") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you consider properly parsing this as a URL instead of doing a simple string split?
Relatedly, this looks like it will panic if the repo URL is sufficiently malformed. Even if string splitting is the best solution, we should avoid that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! You're right, parsing it as a URL would make it much more reliable. I'll make the changes.
pkg/sources/gitlab/gitlab.go
Outdated
| } | ||
|
|
||
| // Clear the repo to project cache when done. | ||
| defer s.repoToProjCache.clear() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I respect your concern for cache size, I don't believe clearing this cache is necessary. Source structures don't survive past a single scan, so clearing them at the end of the scan doesn't reclaim any memory that wouldn't be soon reclaimed anyway, and introduces the possibility of race conditions in the case that the git-parsing code sends work to separate goroutines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
repoToProjCache uses a mutex internally so there wouldn't be a possibility of race conditions, but yeah if a Source is always used for a single scan then freeing up the memory doesn't make sense. I'll remove that.
pkg/sources/gitlab/gitlab.go
Outdated
| type project struct { | ||
| id int | ||
| name string | ||
| owner string | ||
| } | ||
|
|
||
| type repoToProjectCache struct { | ||
| sync.RWMutex | ||
|
|
||
| cache map[string]*project | ||
| } | ||
|
|
||
| func (r *repoToProjectCache) get(repo string) (*project, bool) { | ||
| r.RLock() | ||
| defer r.RUnlock() | ||
| proj, ok := r.cache[repo] | ||
| return proj, ok | ||
| } | ||
|
|
||
| func (r *repoToProjectCache) set(repo string, proj *project) { | ||
| r.Lock() | ||
| defer r.Unlock() | ||
|
|
||
| r.cache[repo] = proj | ||
| } | ||
|
|
||
| func (r *repoToProjectCache) del(repo string) { | ||
| r.Lock() | ||
| defer r.Unlock() | ||
|
|
||
| delete(r.cache, repo) | ||
| } | ||
|
|
||
| func (r *repoToProjectCache) clear() { | ||
| r.Lock() | ||
| defer r.Unlock() | ||
|
|
||
| clear(r.cache) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of moving this to a separate file (still within this package)?
| // ensure project details for specified repos are cached | ||
| // this is required to populate metadata during chunking | ||
| for _, repo := range repos { | ||
| s.ensureProjectInCache(ctx, repo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a few passes back and forth to figure out why sometimes you cache projects using ensureProjectInCache, and sometimes you directly cache the the value returned from gitlabProjectToCacheProject. I eventually got there, but laying out the call tree a bit differently would definitely have helped me understand what was going on much more quickly. What I realized I was looking for was a single "entry point" into caching, e.g.:
func (s *Source) cacheProject(repoUrl string, project *gitlab.Project)
When a gitlab.Project is available, this could be used directly. When only a repo URL is available, it could be transformed into a gitlab.Project using a second helper function:
func (s *Source) getGitlabProject(repoURL string) *gitlab.Project
This way, every function call that caches something looks the "same," and the function that acquires new (necessary) information for caching has a clear name. (ensure is a suboptimal function verb, because it's vague.)
What do you think of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ensureProjectInCache is a useful function because it does multiple things:
- It first checks if the project is in cache
- If not, it fetches the project from the API
- It then stores the project in cache
This functionality is required in a couple of places which is why I grouped it into a single function. Breaking it down would simply mean making two separate calls in the same order at two separate places.
What I agree with is that we create the two methods cacheProject and getGitlabProject, and use it inside ensureProjectInCache. So when we're enumerating projects, we directly call cacheProject, but in the case of configured repos and ChunkUnit, we use ensureProjectInCache. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gone ahead and made these changes. Let me know if you think otherwise and I'll revert.
i misread the pr when i requested changes
rosecodym
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm bummed out that the git processing code requires this sort of contortion, but it looks like you've done it pretty cleanly. Nice work!
Description:
This PR adds the changes to include gitlab project details like project ID, project name and project owner name to the metadata of a chunk.
Achieving this wasn't straightforward as the chunks are generated by the
gitsource and it calls an injectedSourceMetadataFuncto create the metadata. The signature of this function obviously does not include the gitlab project ID.The solution implemented here is to maintain a
repoToProjectCachemap that stores the project details for a repo. This cache is populated as we enumerate the repos and the callback function uses this cache to populate the project details.The only concerning part of this solution is memory usage, so just to put in perspective how much this impacts that, I found a public organization with ~3000 projects and ran benchmarks for before and after making the changes. (I know this isn't anywhere close to the largest organizations we've come across, but this is the biggest one I could find that was public)
Results before making changes:
Total memory usage: 43234992 bytes (43.23 MBs)
Results after this change:
Total memory usage: 44666216 bytes (44.66 MBs)
It's important to note that this doesn't really perform a full scan (I tried doing that first but it would take hours), so I tweaked the code to expose the callback function and directly called that after enumerating the repos. This is the code used for benchmarking:
Checklist:
make test-community)?make lintthis requires golangci-lint)?