Commit Graph

42 Commits

Author SHA1 Message Date
Alex Hancock 7134e89c4b feat: improved UX for tool calls via execute_code (#6205) 2025-12-22 10:42:20 -05:00
Michael Neale d4814042e6 chore: cover code mode with end to end provider tests (#6183) 2025-12-19 12:02:06 +08:00
Jack Amadeo 7ff3adcc5f Clean PR preview sites from gh-pages branch history (#6161) 2025-12-18 16:22:57 -05:00
Jack Amadeo 9fdb0356f0 Disallow subagents with no extensions (#5825) 2025-12-15 12:45:42 -05:00
tlongwell-block a131b08817 refactor: unify subagent and subrecipe tools into single tool (#5893) 2025-12-13 13:50:20 -05:00
Michael Neale 7dd244eff6 chore: avoid accidentally using native tls again (#6086) 2025-12-12 11:35:52 +11:00
Douwe Osinga 5f50198318 feat: @goose in terminal (native terminal support) (#5887)
Co-authored-by: Bradley Axen <baxen@squareup.com>
Co-authored-by: Michael Neale <michael.neale@gmail.com>
Co-authored-by: Douwe Osinga <douwe@squareup.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-12-01 17:40:17 +11:00
David Katz c1c772b267 Add out of context compaction test via error proxy (#5805) 2025-11-21 14:51:01 -05:00
Douwe Osinga f4724cbf23 Comment out the flaky mcp callers (#5827)
Co-authored-by: Douwe Osinga <douwe@squareup.com>
2025-11-20 21:21:38 +01:00
Salvatore Testa cfdf01567d fix: support Gemini 3's thought signatures (#5806)
Signed-off-by: Salvatore Testa <sal@withpersona.com>
2025-11-20 16:28:27 +11:00
David Katz 1d8d6a1788 Provider error proxy for simulating various types of errors (#5091) 2025-11-18 17:28:07 -05:00
Michael Neale 2bef034303 feat: trying grok for live test (#5732) 2025-11-17 09:37:43 +11:00
Jack Amadeo d4f66f4855 faster, cheaper (pick two): improve CI workflow and switch to free github runner (#5702)
Co-authored-by: Douwe Osinga <douwe@block.xyz>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-11-14 12:58:57 -05:00
Jack Amadeo 5110d32142 bump openapi version directly (#5674) 2025-11-11 10:15:42 -05:00
Alex Hancock 7ec3b84ad7 fix: gemini flash -> pro for mcp smoke tests (#5574) 2025-11-06 10:05:18 -05:00
David Katz eb29083a52 Manual compaction test and fix (#5568) 2025-11-06 10:03:48 -05:00
Zane 89f7384d57 add clippy warning for string_slice (#5422)
Co-authored-by: Douwe Osinga <douwe@squareup.com>
2025-11-04 17:46:25 -05:00
Michael Neale 7511a533d6 we should run this on main and also test open models at least via ope… (#5556)
adds qwen3-code and GLM 4.6 to test_providers for open model coverage
2025-11-04 09:06:23 +11:00
Alex Hancock 38e7dc8f30 fix: remove qwen3-coder from provider/mcp smoke tests (#5551) 2025-11-03 14:33:49 -05:00
Alex Hancock c1c13716e0 chore(tests/mcp): testing for MCP sampling (#5456) 2025-11-03 12:23:11 -05:00
Amed Rodriguez d9633ff1d9 Change Recipes Test Script (#5457) 2025-10-30 16:00:25 -07:00
Michael Neale b94535b679 testing tetrate with sonnet (#5428) 2025-10-29 11:40:02 +11:00
Amed Rodriguez 4687656487 Add Recipes Test Script (#5420) 2025-10-28 17:17:51 -07:00
Douwe Osinga 6b6c50976c Gemini again (#5390)
Co-authored-by: Douwe Osinga <douwe@squareup.com>
2025-10-27 16:41:00 -04:00
Will Pfleger 044b227fdb (re)Standardize Session Name Attribute (#5279) 2025-10-24 13:34:08 -04:00
Michael Neale 3c975bb358 live testing script (#5263)
Co-authored-by: Jack Amadeo <jackamadeo@squareup.com>
2025-10-21 16:39:58 +11:00
Douwe Osinga 64b37339e0 Skip subagents for gemini (#5257)
Co-authored-by: Douwe Osinga <douwe@squareup.com>
2025-10-18 17:35:29 -04:00
Michael Neale 890393bb68 Revert "Standardize Session Name Attribute" (#5250) 2025-10-18 12:44:30 -04:00
Will Pfleger b8c3508178 Standardize Session Name Attribute (#5085) 2025-10-17 17:05:41 -04:00
Jack Amadeo 757ceb6109 chore: turn clippy on for test code (#4817) 2025-09-26 00:06:07 -04:00
Angie Jones 63f3669cf7 Remove deprecated Claude 3.5 models (#4590) 2025-09-10 14:41:02 -05:00
Jack Amadeo 7c2b40cc21 Clean up langfuse docs and scripts (#4220) 2025-08-20 10:46:31 -04:00
Jack Amadeo dd504741a3 Remove cognitive complexity clippy lint (#4010) 2025-08-11 20:24:37 -04:00
Michael Neale 8f54fa84a5 fix: optimise reading large file content (#3767) 2025-08-06 09:38:52 +10:00
Lifei Zhou 48a38dc034 Chore: apply more clippy rules to prevent from code complexity (#3813) 2025-08-03 20:03:08 +10:00
Prem Pillai f21b9017b8 fix: ensure retry-config and success-criteria are populated in openapi spec (#3575) 2025-07-22 19:39:35 +10:00
Alice Hau be09849128 [feat] goosebenchv2 additions for eval post-processing (#2619)
Co-authored-by: Alice Hau <ahau@squareup.com>
2025-05-21 15:00:13 -04:00
marcelle 8fbd9eb327 feat: efficient benching (#1921)
Co-authored-by: Tyler Rockwood <rockwotj@gmail.com>
Co-authored-by: Kalvin C <kalvinnchau@users.noreply.github.com>
Co-authored-by: Alice Hau <110418948+ahau-square@users.noreply.github.com>
2025-04-08 14:43:43 -04:00
Alice Hau bb4feacf03 feat: add additional goosebench evals (#1571)
Co-authored-by: Alice Hau <alice.a.hau@gmail.com>
2025-03-10 15:11:44 -04:00
marcelle 49dee048e4 feat: goose bench framework for functional and regression testing
Co-authored-by: Zaki Ali <zaki@squareup.com>
2025-03-05 21:23:00 -05:00
Bradley Axen 1c9a7c0b05 feat: V1.0 (#734)
Co-authored-by: Michael Neale <michael.neale@gmail.com>
Co-authored-by: Wendy Tang <wendytang@squareup.com>
Co-authored-by: Jarrod Sibbison <72240382+jsibbison-square@users.noreply.github.com>
Co-authored-by: Alex Hancock <alex.hancock@example.com>
Co-authored-by: Alex Hancock <alexhancock@block.xyz>
Co-authored-by: Lifei Zhou <lifei@squareup.com>
Co-authored-by: Wes <141185334+wesrblock@users.noreply.github.com>
Co-authored-by: Max Novich <maksymstepanenko1990@gmail.com>
Co-authored-by: Zaki Ali <zaki@squareup.com>
Co-authored-by: Salman Mohammed <smohammed@squareup.com>
Co-authored-by: Kalvin C <kalvinnchau@users.noreply.github.com>
Co-authored-by: Alec Thomas <alec@swapoff.org>
Co-authored-by: lily-de <119957291+lily-de@users.noreply.github.com>
Co-authored-by: kalvinnchau <kalvin@block.xyz>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Rizel Scarlett <rizel@squareup.com>
Co-authored-by: bwrage <bwrage@squareup.com>
Co-authored-by: Kalvin Chau <kalvin@squareup.com>
Co-authored-by: Alice Hau <110418948+ahau-square@users.noreply.github.com>
Co-authored-by: Alistair Gray <ajgray@stripe.com>
Co-authored-by: Nahiyan Khan <nahiyan.khan@gmail.com>
Co-authored-by: Alex Hancock <alexhancock@squareup.com>
Co-authored-by: Nahiyan Khan <nahiyan@squareup.com>
Co-authored-by: marcelle <1852848+laanak08@users.noreply.github.com>
Co-authored-by: Yingjie He <yingjiehe@block.xyz>
Co-authored-by: Yingjie He <yingjiehe@squareup.com>
Co-authored-by: Lily Delalande <ldelalande@block.xyz>
Co-authored-by: Adewale Abati <acekyd01@gmail.com>
Co-authored-by: Ebony Louis <ebony774@gmail.com>
Co-authored-by: Angie Jones <jones.angie@gmail.com>
Co-authored-by: Ebony Louis <55366651+EbonyLouis@users.noreply.github.com>
2025-01-24 13:04:43 -08:00
Salman Mohammed 8cf7b9f26c refactor: move langfuse wrapper to a module in exchange instead of a package (#138)
Co-authored-by: Alice Hau <ahau@squareup.com>
2024-10-16 09:30:13 -04:00