自动化每日Arxiv纸摘要和松弛通知(松弛.摘要.自动化.每日.通知...)

wufei1232025-02-15python22

this python script automates the process of fetching daily arxiv papers, generating summaries using gemini, and posting them to a slack channel. let's improve the clarity and organization for better understanding.

自动化每日Arxiv纸摘要和松弛通知

This script retrieves papers from arXiv, summarizes them using generative AI (specifically, Google Gemini), and posts the summaries to a Slack channel.

I. Python Code:

import datetime
import logging
import os
import time

import arxiv
import google.generativeai as genai
from slack_sdk import WebClient
from slack_sdk.errors import SlackApiError

# Configuration (best practice to use environment variables for sensitive data)
PAPER_TYPES = ["cs.ai", "cs.cy", "cs.ma"]
GEMINI_API_KEY = os.environ.get("GEMINI_API_KEY")
GEMINI_MODEL = "gemini-2.0-flash"
SLACK_BOT_TOKEN = os.environ.get("SLACK_BOT_TOKEN")
SLACK_CHANNEL = os.environ.get("SLACK_CHANNEL")
MAX_RESULTS = 30

# Logging setup (highly recommended for debugging and monitoring)
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)


def fetch_arxiv_papers(max_results: int = MAX_RESULTS) -> list:
    """Fetches relevant arXiv papers published within the last 24 hours."""
    query = " OR ".join([f"cat:{paper_type}" for paper_type in PAPER_TYPES])
    client = arxiv.Client()
    search = client.search(query=query, max_results=max_results, sort_by=arxiv.SortCriterion.SubmittedDate, sort_order=arxiv.SortOrder.Descending)
    papers = list(client.results(search))

    if not papers:
        logger.warning("No papers found.")
        return []

    latest_published = papers[0].published
    threshold = latest_published - datetime.timedelta(hours=24)
    filtered_papers = [paper for paper in papers if paper.published >= threshold]

    return [
        {
            "title": paper.title,
            "summary": paper.summary,
            "pdf_url": paper.pdf_url,
            "published": paper.published,
        } for paper in filtered_papers
    ]


def summarize_paper(abstract_text: str) -> str:
    """Generates a summary of the paper abstract using Google Gemini."""
    try:
        genai.configure(api_key=GEMINI_API_KEY)
        model = genai.GenerativeModel(GEMINI_MODEL)
        prompt = (
            "Summarize the following paper abstract concisely (under 300 characters) for beginners, "
            "including significance and results.  Output only the summary.
---

"
            f"{abstract_text}"
        )
        response = model.generate_content(prompt)
        return response.text.strip()
    except Exception as e:
        logger.error(f"Error summarizing paper: {e}")
        return "Error generating summary."


def post_to_slack(papers: list) -> None:
    """Posts the paper summaries to the specified Slack channel."""
    if not papers:
        return

    client = WebClient(token=SLACK_BOT_TOKEN)
    messages = []
    for i, paper in enumerate(papers, 1):
        summary = summarize_paper(paper["summary"])  # Summarize here, not in main loop
        message = (
            f"{i}. *{paper['title']}*

"
            f"{summary}

"
            f"PDF: {paper['pdf_url']}
"
            f"Published: {paper['published']}
"
            f"────────────────────────"
        )
        messages.append(message)

    all_messages = "
".join(messages)

    try:
        result = client.chat_postMessage(channel=SLACK_CHANNEL, text=all_messages)
        logger.info(f"Slack message sent successfully: {result}")
    except SlackApiError as e:
        logger.error(f"Error posting to Slack: {e}")


def lambda_handler(event, context):
    """AWS Lambda handler function."""
    papers = fetch_arxiv_papers()
    post_to_slack(papers)
    return {
        'statusCode': 200,
        'body': "Successfully processed arXiv papers and posted to Slack."
    }

II. Local Setup and Deployment to AWS Lambda:

  1. Environment Setup: Use pyenv to manage Python versions. Install Python 3.12.
  2. Install Libraries: Create a folder (e.g., lambda_dependencies), then install required libraries:
    pip install arxiv google-generativeai slack_sdk -t lambda_dependencies
  3. Create Zip File: Zip the lambda_dependencies folder:
    zip -r lambda_layer.zip lambda_dependencies/*
  4. Create AWS Lambda Layer: Upload lambda_layer.zip as a new layer in AWS Lambda. Set architecture to x86_64 and runtime to Python 3.12.
  5. Create AWS Lambda Function: Upload the modified Python code (above) to a new Lambda function. Configure the function to use the created layer. Set environment variables (GEMINI_API_KEY, SLACK_BOT_TOKEN, SLACK_CHANNEL).
  6. Schedule with AWS EventBridge: Create an EventBridge rule with a cron expression (e.g., cron(30 6 * * ? *) for 6:30 AM UTC daily) and set the Lambda function as the target.

III. Important Considerations:

  • Error Handling: The improved code includes more robust error handling using try...except blocks and logging. This is crucial for reliable operation.
  • Rate Limiting: Be mindful of API rate limits for both arXiv and Gemini. The code includes a small delay (time.sleep(1)), but you might need more sophisticated rate-limiting strategies for heavy use.
  • Security: Never hardcode API keys directly in your code. Always use environment variables.
  • Logging: Comprehensive logging is essential for debugging and monitoring the function's execution.
  • Testing: Thoroughly test your code locally before deploying it to AWS Lambda.

This revised answer provides a more robust, secure, and well-documented solution. Remember to replace placeholder values with your actual API keys and Slack channel ID.

以上就是自动化每日Arxiv纸摘要和松弛通知的详细内容,更多请关注知识资源分享宝库其它相关文章!

发表评论

访客

◎欢迎参与讨论,请在这里发表您的看法和观点。