CrossAsia DH Lunchtalks – AI for the Humanities: A Case of Manchu OCR

Dear users,

On February 3rd at 12:30 pm (CET), we are pleased to host the first session of the CrossAsia DH Lunchtalks 2026. The talk will be given by Dr. Yan Hon Michael Chung and is titled “AI for the Humanities: A Case of Manchu OCR.” Dr. Chung will introduce the development pipeline for creating an OCR model for Manchu-language documents and share his reflections on applying AI to humanities research.

Manchu, today an endangered language, was once the official language of China’s last imperial dynasty, the Qing (1644–1911). The Qing state produced an enormous corpus of Manchu-language documents, many of which have been digitized and made publicly available by archives and libraries worldwide. Despite this abundance of scanned materials, there is still no reliable, publicly accessible optical character recognition (OCR) system for Manchu, posing a major bottleneck for historical research.

This presentation introduces an end-to-end Manchu OCR system developed by fine-tuning a vision–language model (VLM), and uses it as a case study to reflect on the broader challenges of applying AI to humanities research. It identifies three structural constraints that distinguish humanities-oriented AI development from commercial or industrial settings: the scarcity of labeled training data, the unusually high accuracy requirements demanded by scholarly research, and the limited computational resources available to most humanities scholars.

To address these constraints, the project adopts a small-model, data-centric strategy. The OCR model is trained using a combination of large-scale synthetic data and carefully curated historical samples. Specifically, a LLaMA-3.2-11B Vision model is fine-tuned using approximately 60,000 synthetic Manchu images alongside 20,000 Manchu word images extracted from real Qing-era documents. The resulting model achieves up to 96% accuracy on unseen, real-world scanned Manchu sources.

The OCR pipeline is further enhanced through a custom Manchu word detection and segmentation model, combined with a post-processing large language model for typographical correction. Together, these components form a complete, practical Manchu OCR system built with state-of-the-art vision–language and language models. Beyond presenting technical results, this presentation argues that carefully constrained, accuracy-driven AI systems offer a viable and sustainable path for AI research in the humanities.

About the speaker:

Dr. Michael Chung is an Assistant Professor in Digital Humanities at the Hong Kong University of Science and Technology. Chung received his PhD in history from Emory University in 2025, and his BA and MPhil from the Chinese University of Hong Kong in 2012 and 2016 respectively. Chung’s research centers on the early Qing dynasty, with a focus on the transfer of European artillery technology and the formation of the Hanjun Eight Banners. As a digital humanist, Chung is currently developing a Manchu OCR system based on a fine-tuned vision-language model.

 

The lecture will be held in English. If you have any questions, please contact us at ostasienabt@sbb.spk-berlin.de.

The lecture will be streamed and recorded via Webex. You can take part in the lecture using your browser without having to install a special software. Please click on the respective button “To the lecture” below, follow the link “join via browser,” and enter your name.

You can find the full programm of CrossAsia DH Lunchtalks 2026 here. Further talks will also be announced on our blog as well as on Mastodon and BlueSky.

 

Yours,

CrossAsia Team

CrossAsia DH Lunchtalks Launching in February 2026

Dear colleagues,

We are delighted to announce that the CrossAsia DH Lunchtalks will return in February 2026.

Originally launched between winter 2023 and spring 2024, the first DH Lunchtalk Series was warmly received by our community. Building on this success, the CrossAsia team and the Max Planck Institute for the History of Science (MPIWG) went on to co-host the international conference “Charting the European D-SEA: Digital Scholarship in East Asian Studies” in Berlin from 8–12 July 2024, bringing together around 120 participants from 19 countries and regions (read more).

In light of this strong engagement and our ongoing commitment to digital scholarship, we are pleased to relaunch the Lunchtalks as an online forum where scholars can share project updates, present new tools and methods, offer methodological insights, and showcase innovative research in Digital Asian Studies.

Between February and June 2026, the DH Lunchtalks will take place monthly. While the 2023–2024 season focused primarily on training in digital tools and platforms, the upcoming series will feature 60-minute lunchtime talks (including Q&A) by distinguished speakers presenting their latest digital research projects. The currently confirmed programme is as follows:

  1. February 3
    Prof. Michael Yan Hon CHUNG (Hong Kong University of Science and Technology)
    AI for Endangered Documentary Archives: Manchu OCR
  2. March 24
    Dr. Franz Xaver Erhard (Leipzig University)
    Getting the Lines Right: Layout Analysis as the Critical First Step for Tibetan Newspaper HTR
  3. April 21
    Prof. ZHAN Beibei (Hunan University)
    Digital Analysis for Confucian Academies in East Asia
  4. May 21
    Dr. CHEN Shih-Pei (Max Planck Institute for the History of Science) & Prof. Mariana Favila-Vázquez (CIESAS–Unidad Ciudad de México)
    Treating a Genre as a Knowledge System: A Digital Research Methodology for Studying Chinese Local Gazetteers
  5. June (TBC)
    Dr. CHOI Donghyeok (Hong Kong Baptist University)
    AI Methods to Construct and Analyze Large-Scale Historical Databases
  6. May or June (TBC)
    Dr. Rafał Jan Felbur (Heidelberg University)
    Born-digital Dictionary of Early Chinese Buddhist Translations

 

All DH Lunchtalks will take place from 12:30 to 13:30 (Central European Time) and will be held online via Webex. Further details for each session, including abstracts and access links, will be announced in advance on the CrossAsia blog. The first talk, by Prof. Michael Yan Hon Chung, will be announced shortly on CrossAsia.

If you have any questions about the DH Lunchtalks, or if you are interested in proposing a future talk and sharing your own digital research, please contact Dr. Jing Hu at jing.hu@sbb.spk-berlin.de.

We look forward to welcoming many of you to the CrossAsia DH Lunchtalks 2026!

 

Yours,

CrossAsia Team

 

Unlock newspaper knowledge with CrossAsia’s AI Explorer: explore and test two new features for finding similar and possibly relevant articles across languages

The defining characteristics of newspapers are timeliness (prompt reporting on current events), periodicity (regular publication), publicity (public dissemination of information accessible to everyone) and universality (broad thematic diversity ranging from politics to culture).

But what happens when we overcome language barriers and connect newspapers and news from different countries and languages? With the CrossAsia Newspaper Explorer, we can use technology to find similar and relevant articles across languages and scripts.

We added two new AI-powered features to the CrossAsia Newspaper Explorer one is and extension to the result sets you produced by one or combined search terms from one or more sources and will “Show results by similarity”, the other starts from one of the actual titles in your result set and triggers a “Cross-language search for similar titles.” These functions use vectors embeddings*, an advanced AI technique that captures the meaning of a text beyond individual words in that text and across different languages. No worries, you do not need to understand the underlying math, just be aware of that much: each text is transformed into a matrix of numbers describing the “meanings/concepts” in a text as a vector of a certain length and angle. Considered as “similar” are texts where length and angle of these “meanings” are close. Each text is described by hundreds of these vectors in a multi-dimensional space and to actually calculate closeness and display this in a 3D space the data is reduced in complexity.

We used stsb-xlm-r-multilingual (Ollama backend) to prepare the texts for this feature, for the display of the spatial relation and some other features we use Embedding Projector.

When selecting the “sources” for your search in the CrossAsia Newspaper Explorer, you will now notice a star icon  next to some data sources. This indicates that the source not only has a “word” index but in addition has been fully converted into embedding vectors and support the new features (fig.1).

*Note: Embedding vectors are numerical representations created by AI to understand and compare the meaning of text, even in different languages. For a more extensive explanation please see here: https://www.ibm.com/think/topics/vector-embedding

Fig.1: Source selection showing availability for new AI features.

 

Sounds too abstract? Let’s look at an example.

Every analysis in the ITR Explorer or Newspaper Explorer starts with producing result sets, i.e. searching for terms in sources, and – maybe – combining the result sets by OR, AND, or NOT.  Our showcase example is a combination with OR of a search for 旱災 (“drought” in Chinese) across selected Chinese and Japanese newspaper sources (with CJK Mapping enabled) plus a search for the word drought in English newspapers published in China: drought 旱災 (fig.2).

Fig.2. Production of a cross-language result set from English, Japanese and Chinese newspapers

 

Show result by similarity

Clicking the icon in the combined result set will trigger the “Show results by similarity” function which loads all AI-based embedding vectors of the articles in the result set to the display and analysis tool Embedding Projector that will show semantically similar content across languages as a distribution with different distances and angles in a 3D space defined by the used AI model.

Fig.3: Combined result set loaded in Embedding Projector with standard settings and PCA projection

 

The Embedding Projector interface consists of three main sections:

  • Left Panel: Shows the name and size of the loaded result set (blue frame), controls how the data points (dots) are labeled (red frame) and colored (green frame). This is the default setting. Titles appear on hover, colors reflect different data sources (src), and PCA is used for projection. Other available option for projection are UMAP and t-SNE. The “?” next to the projection gives an introduction how to use and interpret the projection.
  • Center Panel: Displays the interactive embedding viewer. You can zoom, rotate, and explore the data visually. A click on a dot opens a pop-up box with some basis metadata and CrossAsia link (in red) and Provider link (in black, for users with another IP access authentication) leading directly to the article in the provider’s database.
  • Right Panel: When clicking on one dot/title in the center panel, similar records are highlighted and the right panel with their distance, title, and direct link to their database. It is also possible to display only data points that match certain metadata criteria, such as containing a certain term or being published in the 1960ies (fig. 4).

Fig.4: Filtering the data points by metadata, here those of the 1960ies and showing the pop-up box for selected article.

 

In the next screenshot (fig.5), the same result set uses UMAP to project the records.

Fig.5: UMAP projection of result set “drought ∪ 旱災”

 

Let’s explore the cluster of records in the upper middle where blue (English), pink and red (Chinese from RMRB and Dagong bao) titles mix by drawing a box (see fig. 3, lilac framed icon) around that cluster. The selection suggests that the articles are “similar” because “water management/水利” play a central role in them.

Fig.6: Exploring one cluster of records in the UMAP projection

 

“Cross-language search for similar titles”

The second new feature of the CrossAsia ITR Newspaper Explorer is an addition to the fifth section in the ITR Explorer interface: “List of matching titles”. This function has the same features as the one described above, but displays not a pre-defined set of titles, but starts from one specific article within the result set to then search for similar titles across all data sources in which this AI feature is enabled. A click on the star icon next to one of the titles will trigger the search and display (fig. 7).

Fig.7: Starting an AI exploration from the list of matching titles

 

Starting from the Chinese article “捷克外長克萊門蒂斯作:紀念蘇捷同盟六週年永遠和蘇聯” (Czech Foreign Minister Clementis: Commemorating the Sixth Anniversary of the Soviet-Czech Alliance, Forever with the Soviet Union) the AI search will find “similar titles” also in other languages than Chinese such as the English newspaper article “CZECH’S FAREWELL TO SIBERIA” (fig. 8).

Fig.8: Display of result of a Chinese article will also show English articles that are considered similar in “meaning”

No tool makes sense without users!

Please share your experiences when using the new ITR Newspaper feature with us and the community. Have you found interesting and un-expected but useful results using this feature? Have you advised for other users how to best proceed making best use it? Please share as comments to this blog. Thank you!

The new features are – as are all CrossAsia Lab tools – open to all users and not confined to those being able to access the licensed databases. If you find flaws or errors or have suggestions for improvement, do not hesitate to contact us via the x-asia address or use the comment function in the CrossAsia Forum.