Background
Recently, Langchain and HuggingFace jointly released a new partner package. While Langchain already had a community-maintained HuggingFace package, this new version is officially supported by HuggingFace as a Langchain partner! Langchain provides a common interface for interacting with various LLMs, while HuggingFace offers inference endpoints for hosted LLMs, including open-source models.
In this blog, I'll share my experience using HuggingFace's inference of an open-source model with this new Langchain library.
TL;DR
If you prefer to try it out yourself, you can clone my repo:
The Experiment
Inference Options via HuggingFace
HuggingFace provides three ways to perform inference:
- Directly via UI: A chat widget is available for each model. You can select a model from the list, such as Meta's LLAMA model.
- (Free) Inference API (Serverless): This option is suitable for minimal testing. It uses HuggingFace's shared infrastructure, so rate limits apply. An access token from your account settings is used as an API key. We will use this option to try the Langchain library.
- (Paid) Inference Endpoints (Dedicated API): Suitable for production use, but since this is an experiment, we will not deploy and use this option.
Langchain_HuggingFace Library
This library exposes two classes to interact with HuggingFace LLMs: HuggingFacePipeline and HuggingFaceEndpoint.
We are interested in using HuggingFaceEndpoint because it allows for remote inference. Under the hood, this class uses an InferenceClient. In case of some models you need to agree to their terms on the model's huggingface page (example Meta's LLAMA) in order to use it. HuggingFacePipeline would require downloading the model locally, which is not ideal unless you have specific reasons.
After instantiating the HuggingFaceEndpoint class, we define some langchain.schema messages. Another important class in this library is the ChatHuggingFace class, which enhances the prompt with special tokens depending on the specific model for better inference. It also adds model metadata in the response, such as tokens used, ensuring uniformity in responses promised by Langchain.
Check out the code I wrote for this experiment!
Overall Impressions
The overall experience was not smooth, as the library seems to be a work in progress. Here are some specific issues I encountered:
- Outdated Docstrings: The class docstrings in the IDE are not updated.

2. Incomplete Documentation: Langchain's documentation is not updated, possibly pending an update in Langchain v0.2. Simply following their announcement would definitely not get it work.
3. Non-functional Parameters: Several parameters do not affect the model response or raise errors.
llm = HuggingFaceEndpoint(
repo_id=LLAMA_INSTRUCT, # if endpoint_url is used instead we need to also provide the model_id in ChatHuggingFace
task="text-generation",
streaming=True,
max_new_tokens=1024, # doesn't seem to have any impact on output length
model="", # this is a required field so we need to provide it but it doesn't seem to have any impact on output, only repo_id does
client=None,
async_client=None,
return_full_text=True,
repetition_penalty=1.1,
cache=False,
do_sample=False,
)Conclusions & Possible Future Work
Their announcement concluded with:
We are committed to making langchain-huggingface better by the day. We will be actively monitoring feedback and issues and working to address them as quickly as possible. We will also be adding new features and functionality and expanding the package to support an even wider range of the community's use cases. We strongly encourage you to try this package and to give your opinion, as it will pave the way for the package's future.
I hope this experiment serves as valuable feedback, and I plan to create an issue on the Langchain repository. This experiment gave me a better understanding of free OSS LLM inference on HuggingFace and the state of its Langchain integration library.
For future experiments, I'm considering following this example to create an agent. Stay tuned for my next blog post!